install.packages(“knitr”)
install.packages(“forecast”, repos = “http://cran.us.r-project.org”)
install.packages(“lmtest”, repos = “http://cran.us.r-project.org”)
Tip: You will see quoted sections like this throughout the template to help you construct your report. Make sure that you remove these notes before you finish and submit your project!
Tip: One of the requirements of this project is that your code follows good formatting techniques, including limiting your lines to 80 characters or less. If you’re using RStudio, go into Preferences > Code > Display to set up a margin line to help you keep track of this guideline!
##
## The downloaded binary packages are in
## /var/folders/rt/l9y6kg3917s2fyhy44jrk27m0000gn/T//RtmpbFJJ4t/downloaded_packages
Tip: Before you create any plots, it is a good idea to provide a short introduction into the dataset that you are planning to explore. Replace this quoted text with that general information!
Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Our dataset consists of 12 variables, with 1599 observations. Quality variable is discrete and the others are continuous.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Red wine quality is normally distributed and concentrated around 5 and 6.
The distribution of fixed acidity is right skewed, and concentrated around 7.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel, right skewed or normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric acid is not normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual sugar is right skewed, and concentrated around 2. There are a few outliers in the plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides is normal, and concentrated around 0.08. The plot has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free sulfur dioxide is right skewed and concentrated around 14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of total sulfur dioxide is right skewed and concentrated around 38. There are a few outliers in the plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The distribution of density is normal and concentrated around 0.9967
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of pH is normal and concentrated around 3.310
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates is right skewed and concentrated around 0.6581. The plot has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of alcohol is right skewed and concentrated around 10.20
We divide the data into 2 groups: high quality group contains observations whose quality is 7 or 8, and low quality group has observations whose quality is 3 or 4. After examining the difference in each feature between the two groups, we see that volatile acidity, density, and citric acid may have some correation with quality. Let’s visualize the data to see the difference.
Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).
Other observations:
The median quality is 6. Most wines have a pH of 3.4 or higher. About 75% of wine have quality that is lower than 6. The median percent alcohol content is 10.20 and the max percent alcohol content is 14.90.
The main features in the data set are pH and quality. I’d like to determine which features are best for predicting the quality of a wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade the quality of wines.
investigation into your feature(s) of interest?
Volatile acidity, citric acid, and alcohol likely contribute to the quality of a wine. I think volatile acidity (the amount of acetic acid in wine) and alcohol (the percent alcohol content of the wine) probably contribute most to the quality after researching information on wine quality.
I created a new variable called “quality.level” which is categorically divided into “low”, “average”, and “high”. This grouping method will help us detect the difference among each group more easily.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
Having visualized acitric acid and volatile acidity data, I observed some unusual distributions, so I guess this fact may have some correlation with the quality of red wine. Since the data is clean, I did not perform any cleaning process or modification of the data.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes. The correlation coefficient between quality and volatile acidity is -0.39. This can be explained by the fact that volatile acidity at too high of levels can lead to an unpleasant, vinegar taste.
## # A tibble: 3 x 7
## quality.level count median mean variance Q1 Q3
## <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 high 217 0.4 0.376 0.0378 0.3 0.49
## 2 average 1319 0.24 0.258 0.0353 0.09 0.4
## 3 low 63 0.08 0.174 0.0430 0.02 0.27
The correlation coefficient is 0.226; the graph shows a weak positive relationship between quality level and citric acid.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
With the correlation coefficient of 0.476, the graph shows a positive relationship between alcohol and quality level. Average quality and low quality wines have their percent alcohol contents concentrated around 10 whereas high quality wines have their percent alcohol contents concentrated around 12.
A weak negative correlation of -0.2 exists between percent alcohol content and volatile acidity.
The correlation coefficient is 0.04, which indicates that there is almost no relationship between residual sugar and percent alcohol content. However, if we actually examine winemaking process, we see that there is a global trend for wines that are made from ripe to overly ripe grape fruit. To keep wines from staying too sweet, the fermentation process has to be left to continue until more of the sugar is consumed, but as a byproduct, more alcohol is present in the wines.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
There is a negative correlation between citric acid and volatile acidity.
The correlation coefficient is -0.5, so the relationship is quite clear. As percent alcohol content increases, the density decreases. The reason is simple: the density of wine is lower than the density of pure water.
This graph shows positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, negative relationship between pH and acidity.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
I observed a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol. I am not suprised at this result, because men tend to grade stronger wines as high quality, whereas wines with low percent alcohol are often not graded as such. High volatile acidity is also perceived to be undesirable because it impacts the taste of wines. Alcohol and volatile acidity don’t have any clear relationship between each other.
(not the main feature(s) of interest)?
Yes, I observed positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, and negative relationship between pH and fixed acidity. Other variables either show very weak relationship or do not show any relationship.
Quality is positively and strongly correlated with alcohol, and it is also negatively correlated with volatile acidity. Alcohol and volatile acidity could be used in a model to predict the quality of wine.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity (y axis)
## [1] "Percent alcohol contents by quality level:"
## # A tibble: 3 x 3
## quality.level mean sd
## <ord> <dbl> <dbl>
## 1 high 11.5 0.998
## 2 average 10.3 0.972
## 3 low 10.2 0.918
## [1] "Volatile acidities by quality level:"
## # A tibble: 3 x 3
## quality.level mean sd
## <ord> <dbl> <dbl>
## 1 high 0.406 0.145
## 2 average 0.539 0.168
## 3 low 0.724 0.248
High quality feature seems to be associated with alcohol ranging from 11 to 13, volatile acidity from 0.2 to 0.5, and citric acid from 0.25 to 0.75
The distribution of low and average quality wines seem to be concentrated at fixed acidity values that are between 6 and 10. pH increases as fixed acidity decreases, and citric acid increases as fixed acidity increases.
##
## Call:
## lm(formula = quality ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
High quality wine density line is distinct from the others, and mostly distributed between 11 and 12.
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.56575 0.05791 113.39 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16
This chart shows a very clear trend; as volatile acidity decreases, the quality of wine increases. Wines with volatile acidity exceeding 1 are almost rated as low quality. The linear model of volatile acidity has an R-squared of 0.152 which means this feature alone does not explain much of the variability of red wine quality.
##
## Call:
## lm(formula = quality ~ volatile.acidity + alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59342 -0.40416 -0.07426 0.46539 2.25809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.09547 0.18450 16.78 <2e-16 ***
## volatile.acidity -1.38364 0.09527 -14.52 <2e-16 ***
## alcohol 0.31381 0.01601 19.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared: 0.317, Adjusted R-squared: 0.3161
## F-statistic: 370.4 on 2 and 1596 DF, p-value: < 2.2e-16
R-squared increases by two times after adding alcohol to the linear model.
Quality has a weak positive relationship with alcohol, and weak negative relationship with volatile acid. The R squared values are low but p-values are significant; this result indicates that the regression models have significant variable but explains little of the variability. The quality of wine does not solely depends on volatile acidity and alcohol but also other features. Therefore, it is hard to build a predictive model that can accurately predict the quality of wine.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
When looking at wine quality level, we see a positive relationship between fixed acidity and citric acid
Residual sugar, supposed to play an important part in wine taste, actually has very little impact on wine quality.
strengths and limitations of your model.
Yes, I created 3 models. Their p-values are significant; however, the R squared values are under 0.4, so they do not provide us with enough explanation about the variability of the response data around their means.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
The distribution of red wine quality appears to be normal. 82.5% of wines are rated 5 and 6 (average quality). Although the rating scale is between 0 and 10, there exists no wine that is rated 1, 2, 9 or 10.
While citric acid do not have a strong correlation with quality, it is an important component in the quality of wine. Because citric acid is an organic acid that contributes to the total acidity of a wine, it is crucial to have a righ amount of citric acid in wine. Adding citric acid will give the wine “freshness” otherwise not present and will effectively make a wine more acidic. Wines with citric acid exceeding 0.75 are hardly rated as high quality. 50% of high quality wines have a relatively high citric acid that ranges between 0.3 and 0.49, whereas average and low quality wines have lower amount of citric acid.
We observed the opposite direction to which quality levels are heading. Wine with high percent alcohol content and low volatile acidity tends to be rated as high quality wine. Based on the result, we can see that the volatile acidity in wine and percent alcohol content are two important components in the quality and taste of red wines.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!
The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables and tried creating 3 linear models to predict red wine quality.
There was a trend between the volatile acidity of a wine and its quality. There was also a trend between the alcohol and its quality. For the linear model, all wines were included since information on quality, volatile acidity and alcohol were available for all the wines. The third linear model with 2 variables were able to account for 31.6% of the variance in the dataset.
There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.